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A NOTE ON INSUFFICIENCY AND THE 
PRESERVATION OF FISHER INFORMATION 

By David Pollard 

Yale University 

^—{ Kagan and Shepp (2005) presented an elegant example of a mix- 

f^ ture model for which an insufficient statistic preserves Fisher infor- 

^Nj mation. This note uses the regularity property of differentiability in 

t I quadratic mean to provide another explanation for the phenomenon 

^1^ they observed. Some connections with Le Cam's theory for conver- 

■^r gence of experiments are noted. 

o 



1. Introduction. Suppose J" = {Pq : 9 G Q} is a statistical experiment, 
a set of probability measures on some measure space (X, A) indexed by a 
subset G of the real line. 

The Fisher information function ly(0) can be defined under various reg- 
ularity conditions. If S* is a measurable map from X into another measure 
space (y,!B), each image measure Qq = SPq (often called the distribution 
of S under Pq, and sometimes denoted by P^S'^) is a probability measure 
on 23. The statistical experiment Q = {Qg : 9 G Q} is less informative, in 
\^j the sense that an observation y ^ Qe tells us less about 9 than an obser- 

h — . vation x ~ Pq. In particular, Iq(^) < Iy{9) for every 9. If 5 is a sufficient 

On statistic the last inequality becomes an equality: there is no loss of Fisher 

r^ information. 

Statistical folklore holds that the converse is also true. For example, 
^^ Lehmann and Casella (1998, page 158) set as an exercise the task of veri- 

y—i fying, "under suitable regularity conditions", results stated by Basu (1964, 

^^ Section 1), including the assertion that there is no loss of Fisher information 

J> if and only if the statistic is sufficient. They interpreted Basu's (unstated) 

K> regularity conditions to be "mainly concerned with interchange of integra- 

?-H tion with differentiation". Nevertheless, Kagan and Shepp (2005) (hence- 

forth K&S) were able to show, by means of a simple example, that it is 
possible to have Iq(^) = 105(6*) for every 9 without S being sufficient. The 
K&S counterexample relies on another property — the support of a density 
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2 DAVID POLLARD 

changing with the unknown parameter — that is notorious for upsetting clas- 
sical statistical theory. 

My purpose in this note is to make two small additions to the K&S 
analysis. 

(i) I reinterpret the phenomenon identified by K&S, using the geometry 

of differentiabilty in quadratic mean, 
(ii) Using an asymptotic argument, I offer an explanation for why the 
extent of the failure of sufficiency in the K&S example is too small 
to be captured by the Fisher information. More precisely, I explain 
why the experiment Q„ obtained by n independent replications of Q is 
asymptotically equivalent (in Le Cam's sense) to the corresponding Tn- 

Most of the necessary theory is already available in the literature but is 
not widely known. The K&S example provides a good showcase for that 
theory. 

2. The K&S example. What follows is a slightly simplified version of 
the K&S construction. 

Start from a smooth probability density 

g{w) = lw^e-'"{w > 0} 

with respect to Lebesgue measure m on the real line. The power w'^ is chosen 
so that 

g{wy fdlogg{w)Y 1 /„ \2 -w f ^ ni 
— g{w) = -(2 — w) e {w > 0} 



g{w) \ dw J 2 

is Lebesgue integrable. The shift family of densities {g{w — 6) : 9 & M} has 
constant Fisher information, 



(1) ^= / g{wf /g{w)dw <oo. 

Let ly denote the probability measure that puts mass 1/2 at each of +1 
and —1. For each ^ G = M define a probability measure Pg on (the Borel 
sigma-field of) X = R x {—1, +1} by means of its density 

(2) f,{x) = {z = +l}giy-e) + {z = -l}g{e-y) where x = (y, z) G X 

with respect to the measure A := m v. That is, the coordinate z has 
marginal distribution u and the conditional distribution of y given z is that 
of 9 + zw where w ^ g independently of z. 
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g(y-e) 



z = +l 



g(e-y) 



z = -l 



y = e 



Here are the pertinent statistical facts for a single observation x = (y, z) 
from Pq. (See Section 3 for some proofs.) Define S{x) = y and A(x) = z. 
The marginal distribution Qg of S has density 

he{y) = ^g{y — ^) + ^g{(^ — y) with respect to m. 

(a) The statistic A is ancillary (its distribution, v, does not depend on 6). 
By itself it gives no statistical information about 6, but in conjunction 
with S it does tell us something about 9: ii z = +1 then (with Pg 
probability one) 9 < y, and if z = — 1 then y < 9. 

(b) The statistic 5 is not sufficient because {z = +1} = {y > 9} a.s.[P6i], 
implying Pg{Z = +1 \ y) = {y > 9} almost surely. Equivalently, 

Pe{A{x) = l\S) = {S{x)>9} a.s.[P,], 

which depends on 9. (More formally, if S were sufficient there would 
exist some measurable function 7r(y) for which Pg{z = 1 | 5) = 7r(5(x)) 
a.s.[Pe]> for every 9.) 

(c) Both T ■= {Pg : 9 £ R} and Q := {Qe : 9 e R} have finite Fisher 
information: Iq{9) = I'j>{9) = I for all 9, with I as in (1). 

In short: There is no loss of Fisher information when only S{x) is observed, 
even though S is not a sufficient statistic. 

Remark. K&S used a slightly more involved construction, with density 



fix, 0)^{z^ +1} [0.7g{y ~ 9) + O.3.g(0 - y)] 
+ {z = -l}[O.3giy-O)+O.7g{0-y)] 



where x — {y, z) (£ X 



with respect to m(g) // where /i{+l} ~ a = 1 — /i{— 1} and a ^ 1/2. The 
analysis that I present can be extended to this fe. 

3. DQM interpretation. K&S attributed the phenomenon in their 
version of the example in Section 2 to a failure of strict convexity of Fisher 



imsart-aos ver. 2011/11/15 file: pollard_april2012.tex date: April 23, 2012 



4 DAVID POLLARD 

information with respect to mixtures of statistical experiments. There is 
another explanation involving the geometry of Hellinger derivatives, which 
I find more illuminating. 

By a theorem of Hajek (1972, Lemma A. 3), Lebesgue integrability of the 
function g^ /g in (1) implies that the set of densities S := {g{y — ^) : G M} 
(with respect to Lebesgue measure) is Hellinger differentiable with Hellinger 
derivative ^{y — 9) at 9, where 

2^g{w) 2V2 

That is, 

2 

^/g{y -9-t)- ^/g{y -9)- tj{y - 9) dy = o{t^) as t ^ 0. 

This assertion is also easy to check by explicit calculations. (See Lehmann 
and Romano (2005, Cor. 12.2.1) for details.) 

The family of densities 'J := {/^{x) : 9 € M}, for fg as in (2), inherits the 
Hellinger differentiability from 9: 

(3) / Vfe+t{x) - y^M - tCe (x) ' X{dx) = o{t^) as t ^ 0, 



for the Hellinger derviative 

Ce{x) := {z = +l}7(y - 9) - {z = -1}^{9 - y). 

The significance of approximation (3) becomes clearer when it is rewrit- 
ten as a differentiability property of the likelihood ratios. That is, it helps 
to work with the square root of the density of Pe+t with respect to Pq. Un- 
fortunately, Pe+t is not dominated by Pq. In general, to eliminate such an 
embarrassment one needs to split Pe+t into a singular part P^q, which con- 
centrates on a set of zero Pq measure, plus a part PqI^^ that has a density pt^ 
with respect to Pe- For reasons related to the asymptotic theory for repeated 
sampling, it is customary to make a small extra assumption about the be- 
havior of -PfgX as t tends to zero. Following Le Cam (1986, Section 17.3) 
and Le Cam and Yang (2000, Section 7.2), I will call the slightly stronger 
property differentiability in quadratic mean (DQM), to stress that the 
definition requires a little more than Hellinger differentiability. 

Remark. Some authors (for example, Bickel et al. 1993, page 457) use 
the term DQM as a synonym for Hellinger differentiability. 
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NO LOSS OF INFORMATION 5 

Definition 4. Say that 7 = {Pq : 9 ^ Q}, with C M, is differentiable 
in quadratic mean (DQM) at 6 with score function IS.q{x) if, for 9 + t £ Q, 

(i) for the part P^g of Pqj^i that is singular with respect to Pq, 

P^q{X) = o{t^) as t ^ 

(n) Ae G J:\Pe) 

(Hi) the absolutely continuous part of Pgj^t has density ptfi{x) with respect 
to Pg for which 



lptfi{x) = 1 + \t/\e{x) + rtfi{x) with Pq [rl^j = o{t^) as t ^ 0. 
Call y DQM if it is DQM at each 9 in 0. 

Remark. The factor of 1/2 in requirement (iii) ensures that PeAg is 
equal to the Fisher information ly (6) if the densities are suitably smooth 
in a pointwise sense. 

The T from Section 2 is, in fact, DQM. For t > the singular part Pf'g 
has density {z = —l}g{9 — y){9 < y < 9 + t} with respect to A, so that 
P^ g(X) = 0(|tp). The part of Pe+t that is dominated by Pq has density 

f N fe+t{x) f , \ ^ nl 
Pt,e[x) = {feix) > 0} 

There is a similar expression for the case t < 0. The score function equals 
Ae{x) = 2^^{fe{x) > 0} 

Vaiv - G) Vai^ - v) 

The density ptfi and the score function Aq{x) are uniquely determined 
only up to a Pg equivalence. As noted for fact (b) near the end of Section 2, 
sufficiency fails for S because 

{z = +1} = {y>e} a.s.[Pe] 

Similarly {z = —1} = {y < 9} a.s.[Pe]. Together these two equivalences 
explain why no Fisher information is lost. The score function Aq is changed 
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6 DAVID POLLARD 

only on a Pg-negligible set if we omit the two indicator functions involving z 
from (6). In effect, the score function ^q(x) depends on x only through the 
value of the statistic S. As the next theorem (which is proved in Section 5) 
shows, that property is equivalent to the preservation of Fisher information. 

Theorem 7. Suppose T = {Pe : 9 e Q] on {X,A) is DQM with score 
function Ag. Suppose S is a measurable map from (X,A) into (y,^) and 
Qo = SPq is the distribution of S under Pg. Then: 

(i) The statistical experiment Q = {Qq : 6 G 0} is also DQM, with score 

function Ag{y) = Pg{Ag \ S = y). 
(a) At each fixed 9, Fisher information is preserved (that is, ly(^) = Iq{9)) 
if and only if AqIx) = Ag{Sx) a.s.[Pe]. 

With only notational changes, the Theorem extends to the case where Q 
is a subset of some Euclidean space; no extra conceptual difficulties arise in 
higher dimensions. 

Credit where credit is due. The results stated in Theorem 7 have an in- 
teresting history. Property (i) was asserted ("Direct calculations show that 
the function q^''^{y',9) is differentiable in L2{y) and possess a continuous 
derivative . . . ") in Theorem 7.2 of Ibragimov and Has'minskii (1981, Chap- 
ter I, page 70), an English translation from the 1979 Russian edition. How- 
ever, that Theorem also (incorrectly, as noted by K&S) asserted that Fisher 
information is preserved if and only if S is sufficient. 

Pitman (1979, pages 19-21) established differentiability in mean, a prop- 
erty slightly different from (i), in order to deduce a result equivalent to (ii). 

Le Cam and Yang (1988, Section 7) deduced an analogue of (i) (preserva- 
tion of DQM under restriction to sub-sigma-fields) by an indirect argument 
using equivalence of DQM with the existence of a quadratic approximation 
to likelihood ratios of product measures (an LAN condition). 

Bickel et al. (1993, page 461) proved result (i), citing Ibragimov and 
Has'minskii (1981), Le Cam and Yang (1988), and van der Vaart (1988, Ap- 
pendix A3) for earlier proofs. The last of these was a revised ( "I have not re- 
sisted the temptation to rewrite numerous parts of the original manuscript") 
version of van der Vaart's 1987 Ph.D. thesis. He cited Le Cam and Yang 
(1988) and a manuscript version of Bickel et al. (1993). 

4. Large sample interpretation. The example in Section 2 shows 
that, for a sample x = {y,z) of size one from Pg, some "statistical infor- 
mation about 9 (namely, whether 9 < y or 9 > y) is lost if we discard z. 
The loss is not detected by Fisher's measure of information. An asymptotic 
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NO LOSS OF INFORMATION 7 

analysis, based on a sample of size n from Pg, sheds a little light on why 
the z contribution is relatively unimportant. 

Write ^g^n and Qe^^ for the n-fold product measures Pq and Qg, with 
the probability measures Pe and Qe as in Section 2. That is, the statistical 
experiment !P„ = {Pg,™ '■ ^ £ ©} corresponds to taking n independent ob- 
servations xi = (yi, zi), . . . , x„ = (y„, Zn) from Pg and Q„ = {Qe,n : 6* G 6} 
corresponds to yi, ... ,yn- 

Both [P„ and Q„ are locally asymptotically normal (Le Cam and Yang, 
2000, Chapter 6). They share the same local normal approximations because 
that have the same score functions and (hence) the same Fisher information 
functions: for each fixed 9 and each finite subset T of the real line, the "local 
experiments" 

{]Pe+to-i/2,„ -.tGT} and {Qe+tn-^^n ■ t £ T} 

are asymptotically equivalent in the (weak) Le Cam sense. The deficiency 
distance (Le Cam and Yang, 2000, Section 2.2) between these two local 
experiments tends to zero as n tends to infinity. The local asymptotic equiv- 
alence of Tn and Q„ has many consequences. For example, classical theory 
establishes existence of many different estimators On = Onixi, ■ ■ ■ ,Xn) for 
which ^/n [On — 9) converges in distribution under Pg„ to A^(0,I~^), and 

many different estimators 0* = ^*(yi, . . . ,y„) for which y/n{0'^ — 0) con- 
verges in distribution under Q_e,n to the same A^(0,I~^). As shown by the 
Hajek-Le Cam convolution and asymptotic minimax theorems (Bickel et al, 
1993, Section 2.3), there are various senses in which the A^(0, 1~^) limit is the 
best we can hope to achieve for either experiment. Asymptotically speaking, 
the Zj's must be contributing at a less important level. 

Remark. Except for the purpose of the root-n asymptotics, perhaps 
we should agree with Basu (1975, Section 5) that the Fisher information 
function is a "mathematically interesting but statistically rather fruitless 
notion" . 

For i = 1, . . . , n, write yi-.n for the largest yi for which Zi = —1 and yji:n 
the smallest yi for which yi = +1. Each Zi tells us whether yi > 6 or yi < 0, 
implying 

(8) yL:n < < yji:n with Fg^n probability one. 

The w'^ decay in g{'w) at zero, implies that both — yL-.n and yn-n — are 
decreasing at an n^^'^ rate. In fact both 'n}'^{0 — yL-.n) and n}-''^{yR:n — 0) 
have nontrivial limit distributions under Pe^. For example, for each s > 
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8 DAVID POLLARD 

direct calculation shows that Pe{6, 9 + sn^^''^) = s^/(6n) + o(l/n), so that 

^e,nW^HyR:n - 6) > s} = Pe,„{no z,'s in {9, 9 + sn-^/^)} -^ exp{-s^/6). 

For any n"-*^' ^-consistent estimator 9n the event An = {yi-.n < ^n < UR-.n} 
has Fg^n probability that tends very rapidly to one. Put another way, 

IPe,n{3i <n:9<yi<9nOv9n<yi<e}^0. 

With high probability, what we learn from the Zj's just duplicates what we 
usually can learn from the y^'s. 

To make the idea more concrete, define z*„ = sgn(yj — On) and x*^ = 
(yi^zln)- That is, 

' +1 if yi>9n 
-1 iiyi<9n 

On the event An we have Xj = x*„ for i = 1, . . . , n. If P^ ^ denotes the joint 
distribution of x^ „,..., x* „ then 

supgge ||Pg_„ - P0,„||tv < supege Fg^nA^ -^ 0. 

In the terminology of Le Cam's theory for convergence of statistical experi- 
ments, ^n and Qn are asymptotically equivalent, not just locally asymptoti- 
cally equivalent in the weak sense. The vector (yi, . . . , y„) is asymptotically 
sufficient for "Pn, in Le Cam's sense. The map (yi, . . . , yn) >-^ (x^ „,..., x* „) 
defines a Le Cam transition (Le Cam and Yang, 2000, Theorem 2.2) that 
bounds the deficiency (5(Q„,y„). 

Put another way, for every statistic ipn{xi, . . . ,Xn) for Tn there is an- 
other statistic ipniyi, ■ ■ ■ ,yn) = '4'n{xl „,..., X* „) for Q„ that has the same 
asymptotic behavior. 

Remark. Rough calculations suggest that the Le Cam distance be- 
tween Tn and Q„ tends to zero like exp(— Cn^/'^) for some constant C 
I omit the details because the actual rate is not important for the story 
I am telling. 

5. Proof of Theorem 7. Recall that the Kolmogorov conditional ex- 
pectation Pg{- I S = y) is abstractly defined, via the Radon-Nikodym the- 
orem, as an increasing linear map (depending on 9) k : L}{Pg) — t- L^{Qg) 
with properties analogous to those enjoyed by a Markov kernel. If we iden- 
tify an / in L^{Pe) with the (signed) measure ^/ for which diXf/dPg = f, 
then g = kJ is the density of Sfif with respect to Qq. To stress the analogy 
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with Markov kernels I will write Kyf, or even Kyf{x), instead of {nf){y). 
Thus the defining property of k can be rewritten as 

(9) Qeh{y)^^yf2 = Peh{Sx)t2{x) 

for measurable real functions /i on ^ and /2 on X, at least when fi{Sx)f2{x) 
is Pg-integrable. A reader who chose to interpret Ky as a Markov kernel would 
lose only a tiny amount of generality. 

Of course if one regards k as acting on the function L^{Pg), instead of on 
the space L^{Pq) of -P^-equivalence classes, then one should qualify assertions 
with the occasional a.s.[-P0] caveats and regard nf as being defined only up 
to Qo equivalence. Following the usual custom, I will omit such qualifiers. 

Proof of assertion (i). The following argument is adapted from van der 
Vaart (1988, Appendix A3). 

To simplify notation, I will prove that Q is DQM only at 9 = 0, writing 
P^ instead of P^q and pt instead of ptfi. Keep in mind that Ky now denotes 
the conditional expectjition operator Pq{- \ S = y). For each function h{x) 
in £^(Po) I will write h{y) for its conditional mean K,yh{x) and 

vaiyh := Ky i h{x) — h{y) j = Kyh{x) — h{y) 

for its conditional variance. 

Start with the simplest case where Pt is actually dominated by Pq. Then 

it{x) = VdPt/dPo = 1 + ^tAo(x) + rt{x) with Pqt^ = o{f) 

and 

(10) lt{y) := Kj,6(x) = 1 + itAo(y) + n{y) with Qo^ < P^r^ = o{e). 
and, by the Radon-Nikodym property, 

rit{y) = \/dQt/dQo = \jKyit{xY ■ 

The proof of assertion (i) will work by showing that the difference 8t{y) := 
rit{y) — Ct{y) is small, in the sense that QoSf = o{t^). For then we will have 

miy) = 1 + 2*^o(y) + [niy) + 5t(y)] with Qo[?t{y) + St{y)f = o{t^), 

which implies DQM for Q at 0. 

The desired property for 5t will be derived from the following three facts 
about the conditional variance 

(11) a^y) := var,(et) = KyCt{xf - Uyf = Vt{y? - Uy? ■ 

imsart-aos ver. 2011/11/15 file: pollard_april2012.tex date: April 23, 2012 



10 DAVID POLLARD 

(a) The representation (T|(y) = Hy [S,t{x) — ^t{y)) gives 

^tiy) = ^^y {b [^o(a:) - Ao(y)] + [n(x) - n(y)]) 

r ~ 1^ ~ 

Ao(x)-Ao(y) +2Ky[rt{x)-rt{y)] 



<2{lt)\y 



J^ riT' KyL\rt ~p ^KyV^ . 



Remark. The canceUation of the leading 1 when ^t is subtracted from ^t 
seems to be vital to the proof. For general Hellinger differentiability, the 
cancellation would not occur. 

(b) 5t{y) > because r]t{yf - lt{y? = af{y) > 0. 

(c) Substitution of 6t + S,t for rjt in (11) gives 

aUy) = 26t{yMy)+6t{yf. 
The rest is easy. For each e > define 

At,e := {y G V : liy) > I My) < e}- 
Integration of inequality (a) gives 

Qo(^t{y) < ^t^Po^l + 2Port = 0{t^) + o{t'^) < Ct^ for some constant C, 
which, together with (10), implies Qo^t,e — ?■ 1 as t — ;■ 0. 



On the set At^e equality (c) ensures that 5t{y) < T^{y) < eat{y); on A 



the nonnegativity of 5t and equality (c) give 6f<(Tf. Thus 

Q^Hyf < e^Qo<yl{y){y g AJ + Qo^t'(y){y i At,,] 

< e^Ct^ + It'^QoKyAlAl^ + 2QoKyr^ by (a). 

The Qo-integrability of Kj^Aq and the Dominated Convergence theorem im- 
ply QoKyA^Al^ -^ 0. It follows that Qo6^ = o{t^). 

Finally, what happens when Pt is not dominated by Pq? The analysis 
for ^|, the density of the part of Pt that is dominated by Pq, is the same 
as before. The image measure SP^ has total mass of order o(t^), part of 
which gets absorbed into Q^. The part of SP^ that is dominated by Qq 
contributes an extra nonnegative term, ^t{y), to the density d-Qf" ^ /dQo. 
The r]f{y) becomes Ky^tiy) + lt{y)- The extra term causes little trouble 
because 



' Hyi'^ <Vt< yKy^t + \/7i and Qojt = o{t ). 

imsart-aos ver. 2011/11/15 file: pollard_april2012.tex date: April 23, 2012 



NO LOSS OF INFORMATION 11 

Proof of assertion (ii). Write IHI for the closed subspace of L?'{Pe) consist- 
ing of (equivalence classes of) functions measurable with respect to the sub- 
sigma-field of A generated by S. Each member of BI is of the form f{Sx) for 
some / in LP'{Qq). The orthogonal projection of Ag onto H equals /S.q{Sx). 
Thus 

ly{e) = PeAeixf = PeMSxf + Pg [Ae{x) - Ae{Sx 

The first term on the right-hand side equals Q0{Ag) = Iq; the last term is 
zero if and only if Ag(x) = Ag{Sx) a.s. [Pg]. 

Acknowledgements. Many thanks to the referee for pointing out the 
connections with the work of Basu. In retrospect, it is a little surprising 
that I was unable to find a counterexample analogous to that of K&S in the 
volume (DasGupta, 2011) of Basu's selected works. 
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